Mixture of latent words language models for domain adaptation
نویسندگان
چکیده
This paper introduces a novel language model (LM) adaptation method based on mixture of latent word language models (LWLMs). LMs are often constructed as mixture of n-gram models, whose mixture weights are optimized using target domain data. However, n-gram mixture modeling is not flexible enough for domain adaptation because model merger is conducted on the observed word space. Since the words in out-of domain LMs often differ from those in the target domain LM, it is hard for out-of domain LMs to offer adequate adaptation performance. Our solution is to carry out model merger in a latent variable space created from LWLMs. The latent variables in the LWLMs are represented as specific words selected from the observed word space, so LWLMs can share a common latent variable space and we can realize mixture modeling with consideration of the latent variable space. Following this change, this paper also describes a method to estimate mixture weights for LWLM mixture modeling. We use a sampling technique based on the Bayesian criterion in place of the conventional expectation maximization algorithm. Our experiments show that the LWLM mixture modeling is more effective than n-gram mixture modeling.
منابع مشابه
Unsupervised Language Model Adaptation Incorporating Named Entity Information
Language model (LM) adaptation is important for both speech and language processing. It is often achieved by combining a generic LM with a topic-specific model that is more relevant to the target document. Unlike previous work on unsupervised LM adaptation, this paper investigates how effectively using named entity (NE) information, instead of considering all the words, helps LM adaptation. We ...
متن کاملNovel weighting scheme for unsupervised language model adaptation using latent dirichlet allocation
A new approach for computing weights of topic models in language model (LM) adaptation is introduced. We formed topic clusters by a hard-clustering method assigning one topic to one document based on the maximum number of words chosen from a topic for that document in Latent Dirichlet Allocation (LDA) analysis. The new weighting idea is that the unigram count of the topic generated by hard-clus...
متن کاملLemmatized Latent Semantic Model for Language Model Adaptation of Highly Inflected Languages
We present a method to adapt statistical N-gram models for large vocabulary continuous speech recognition of highly inflected languages. The method combines morphological analysis, latent semantic analysis (LSA) and fast marginal adaptation for building topic-adapted trigram models, based on a background language model and very short adaptation texts. We compare words, lemmas and morphemes as b...
متن کاملDynamic Language Model Adaptation Using Latent Topical Information and Automatic Transcripts
This paper investigates dynamic language model adaptation for Mandarin broadcast news recognition. A topical mixture model was presented to dynamically explore the long−span latent topical information for language model adaptation. The underlying characteristics and different kinds of model structures were extensively investigated, while their performance was verified by comparison with the con...
متن کاملLatent Domain Phrase-based Models for Adaptation
Phrase-based models directly trained on mix-of-domain corpora can be sub-optimal. In this paper we equip phrase-based models with a latent domain variable and present a novel method for adapting them to an in-domain task represented by a seed corpus. We derive an EM algorithm which alternates between inducing domain-focused phrase pair estimates, and weights for mix-domain sentence pairs reflec...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014